Learning to recognize webpage genres

نویسندگان

  • Ioannis Kanaris
  • Efstathios Stamatatos
چکیده

Webpages are mainly distinguished by their topic (e.g., politics, sports etc.) and genre (e.g., blogs, homepages, e-shops, etc.). Automatic detection of webpage genre could considerably enhance the ability of modern search engines to focus on the requirements of the user’s information need. In this paper, we present an approach to webpage genre detection based on a fully-automated extraction of the feature set that represents the style of webpages. The features we propose (character n-grams of variable length and HTML tags) are language-independent and easilyextracted while they can be adapted to the properties of the still evolving web genres and the noisy environment of the web. Experiments based on two publicly-available corpora show that the performance of the proposed approach is superior in comparison to previously reported results. It is also shown that character n-grams are better features than words when the dimensionality increases while the binary representation is more effective than the term-frequency representation for both feature types. Moreover, we perform a series of cross-check experiments (e.g., training using a genre palette and testing using a different genre palette as well as using the features extracted from one corpus to discriminate the genres of the other corpus) to illustrate the robustness of our approach and its ability to capture the general stylistic properties of genre categories even when the feature set is not optimized for the given corpus.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clustering Music by Genres Using Supervised and Unsupervised Algorithms

This report describes classification methods that recognize the genres of music using both supervised and unsupervised learning techniques. The five genres, classical(C), EDM(E), hip-hop(H), jazz(J) and rock(R), were examined and classified. As a feature selection method, discrete Fourier transform (DFT) converted the raw wave signals of each song into the signal amplitude ordered by their freq...

متن کامل

Challenges in Creating a Taxonomy for Genres of Digital Documents

Introduction: We report on one phase of a project whose aim is to discover whether and how identifying the genres of digital documents helps in a variety of information-seeking tasks (Crowston & Kwa nik, 200507). The project has three phases: I. Harvesting and identifying a test-set of webpages from journalists, teachers, and engineers, three groups that share a discourse community in which a s...

متن کامل

Towards a Reappraisal of Literary Competence within the Confines of ESL/EFL Classroom

The present paper aimed at highlighting the judicious incorporation of literary genres (i.e. novel, short story/fiction, drama, and poetry) as a supposedly inspiring teaching technique and an allegedly potent learning resource into ESL/EFL curricula. The rationale behind this pedagogical inclusion is to promote both teaching and learning effectiveness through capitalizing intensively on the gen...

متن کامل

Towards a Reappraisal of Literary Competence within the Confines of ESL/EFL Classroom

The present paper aimed at highlighting the judicious incorporation of literary genres (i.e. novel, short story/fiction, drama, and poetry) as a supposedly inspiring teaching technique and an allegedly potent learning resource into ESL/EFL curricula. The rationale behind this pedagogical inclusion is to promote both teaching and learning effectiveness through capitalizing intensively on the gen...

متن کامل

A Novel Architecture for Detecting Phishing Webpages using Cost-based Feature Selection

Phishing is one of the luring techniques used to exploit personal information. A phishing webpage detection system (PWDS) extracts features to determine whether it is a phishing webpage or not. Selecting appropriate features improves the performance of PWDS. Performance criteria are detection accuracy and system response time. The major time consumed by PWDS arises from feature extraction that ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Inf. Process. Manage.

دوره 45  شماره 

صفحات  -

تاریخ انتشار 2009